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f-H , ABSTRACT. In the analysis of microarray data, and in some other con- 

temporary statistical problems, it is not uncommon to apply hypothesis tests 
in a highly simultaneous way. The number, N say, of tests used can be much 
larger than the sample sizes, n, to which the tests are applied, yet we wish 
to calibrate the tests so that the overall level of the simultaneous test is 
accurate. Often the sampling distribution is quite different for each test, 
so there may not be an opportunity for combining data across samples. In 
this setting, how large can N be, as a function of n, before level accuracy 
becomes poor? In the present paper we answer this question in cases where 
the statistic under test is of Student's t type. We show that if either Nor- 
mal or Student's t distribution is used for calibration then the level of the 
simultaneous test is accurate provided logiV increases at a strictly slower 
' rate than n 1 / 3 as n diverges. On the other hand, if bootstrap methods are 

used for calibration then we may choose logiV almost as large as n 1 / 2 and 
still achieve asymptotic level accuracy. The implications of these results are 
>- . explored both theoretically and numerically. 
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1. INTRODUCTION 

Modern technology allows us to collect a large amount of data in one 
scan of images. This is exemplified in genomic studies using microarrays, 
tiling arrays and proteomic techniques. In the analysis of microarray data, 
and in some other contemporary statistical problems, we often wish to make 
statistical inference simultaneously for all important parameters. The num- 
ber of parameters, N, is frequently much larger than sample size, n. Indeed, 
sample size is typically small; e.g. n = 8, 20 or 50 are considered to be 
typical, moderately large or large, respectively, for microarray data. The 
question arises naturally as to how large N can be before the accuracy of 
simultaneous statistical inference becomes poor. 

Important results in this direction have been obtained by van der Laan 
and Bryan (2001), who showed that the population mean and variance 
parameters can be consistently estimated when logiV = o(n) if observed 
data are bounded. Bickel and Levina (2004) gave similar results in a high- 
dimensional classification problem; Fan, Peng and Huang (2005) and Huang 
et al. (2005) studied semiparametric inference where N — > oo; and Hu 
and He (2006) proposed an enhanced quantile normalization based on high- 
dimensional singular value decomposition to reduce information loss in gene 
expression profiles. 

Korosok and Ma (2005) treated the problem of uniform, simultaneous es- 
timation of a large number of marginal distributions, showing that if log N = 
o(n), and if certain other conditions hold, then maxi<j<7v \\Fi — FiW^ — > 0, 
where Fi is an estimator of the ith marginal distribution Fj. As a corollary 
they proved that a P- value Pi of Fj converges uniformly in % to its counterpart 
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Pi for Fi, provided logiV = oin 1 / 2 ): 

max II Pj - PJloo -> 0. (1.1) 
l<i<iV 11 11 v ; 

These results are important advances in the literature of simultaneous test- 
ing, where P-values are popularly assumed to be known. For examples in 
the latter setting, see Benjamini and Yekutieli (2001), Dudoit, Shaffer and 
Boldrick (2003), Donoho and Jin (2004), Efron (2004), Genovese and Wasser- 
man (2001), Storey, Taylor and Siegmund (2004), Lehmann and Romano 
(2005), Lehmann, Romano and Shaffer (2005) where many new ideas have 
been introduced to control different aspects of simultaneous hypothesis test- 
ing and false discovery rate (FDR). 

In many practical settings the assumption that P-values are calculated 
without error is unrealistic, but it is unclear how good the approximation 
must be in order for simultaneous inference to be feasible. Simple consistency, 
as evidenced by (1.1), is not enough; the level of accuracy required must 
increase with N. More precisely, letting a at be the significant level, which 
tends to zero as N — > oo, the required accuracy is then 

max \\Pi - Pi\\oo = o(a N ). (1.2) 

l<i< N 

In this paper we provide a concise solution to this problem. 

For example, we show that in the case of simultaneous i- tests, cali- 
brated by reference to Normal or Student's t distributions, a necessary and 
sufficient condition for overall level accuracy to be asymptotically correct is 
(a) logiV = o(n 1 / 3 ). This is true even if the sampling distribution is highly 
skewed or heavy tailed. On the other hand, if bootstrap methods are used 
for estimating P-values then the asymptotic level of the simultaneous test is 
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accurate as long as (b) log N = oin 1 / 2 ). These results make clear the advan- 
tages offered by bootstrap calibration. We shall explore them numerically as 
well as theoretically. Result (a) needs only bounded third moments of the 
sampling distribution, although our proof of (b) needs more restrictions. 

Take the case of family-wise error rate as an example. If the overall 
error rate is controlled at p, then k n hypotheses with the smallest P-values 
are rejected, where 

k n = max{j : P { < p/N, i = 1, . . . , N} (1.3) 

and Pi denotes the significance level of the ith test. The Benjamini and 
Hochberg's (1995) approach to control false discovery rate (FDR) at p is to 
select 

k n = max{i : P {l) < ip/N, i = 1, . . . , N}, (1.4) 

where {P(i)} are the ordered values of {Pi}- If the distributions from which 
the P^s are computed need to be estimated then, in view of (1.3) or (1.4), 
the error of the estimators Pi should equal o(N~ 1 ) in order to sort correctly 
{Pi}, and the approximation (1.1) requires significant refinement. Indeed, it 
corresponds to (1.2) with oln = p/N . This is a very stringent requirement 
(«at = 10~ 5 , if p = 0.1 and N = 10 4 ) and the accuracy is hard to achieve 
for many practical situations. 

Sometimes, the classical approach provides an attractive alternative to 
select significant hypotheses (genes). For example, in their analysis of gene 
expression data, Fan et al. (2004) take a at = 0.001 and find the significant 
set of genes, 

S = {i:P t <a N ,i = l,...,N}, (1.5) 
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for N = 15,000 simultaneous tests. Here a at is an order of magnitude larger 
than iV -1 , and the approximation errors when estimating P- values need only 
be o(aAr) when computing (1.5), rather than o(N~ 1 ) in the family-wise error 
rate problem. In this case, the requirement is much less stringent and the 
number of simultaneous tests N for (1.2) to hold should be an order of 
magnitude larger than the case with a at = iV -1 . However, even if we take 
aN = C N~ a , where C > and a G (0, 1] are constants, from some points of 
view the problem does not become appreciably simpler. For instance, in the 
case of calibration by the Normal or Student's t distributions, condition (a) 
two paragraphs above is still necessary and sufficient. 

In this example the number of elements of 5, denoted by k' n , equals the 
number of genes discovered at the significance level ckjv; and Nccn, an upper 
bound to the expected number of falsely discovered genes, is approximately 
the same as the expected number of falsely discovered genes when most 
null hypotheses are true. Hence, the FDR in this example is estimated as 
p = Ncxn /k' n . Surprisingly, this classical approach is approximately the same 
as the Benjamini-Hochberg method. Using p as the FDR to be controlled, 
the Benjamini-Hochberg method picks up k n = max{z : < ip/N,i = 
1, . . . , N} most significant P-values, whereas the classical approach (1.5) se- 
lects k' n most significant P-values. The classical approach is more conserve 
since it is shown by Fan et al. (2005) that k' n < k n . However, in our simula- 
tion experiments (reported in a previous version), k n ~ k n . In other words, 
the classical approach selects nearly the same set of significant hypotheses 
as that by the Benjamini-Hochberg method. 

The results stated above are for the case of independent tests, but they 
also apply, in the sense of sufficiency for asymptotically conservative tests, 
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under the assumption of positive regression dependency; see Benjamini and 
Yekuteli (2001, p. 1170) for discussion. We discuss too the case where the 
errors in the data come from a linear process. There, theoretical formulae 
for significance levels can be based on Poisson cluster-process arguments. 
To treat simultaneous hypothesis testing under completely general types of 
dependence we use a Bonferroni argument. This permits our results on the 
accuracy of bootstrap and Student's t approximations to be extended to a 
wide range of settings. In particular they apply in the context of generalized 
family-wise error rate. 

Practical implications of our work include the following, (i) When cali- 
brating multiple hypothesis tests using Student's t statistic, for example with 
a view to controlling family-wise error rate, impressive level-accuracy can be 
obtained via Student's t approximation. Only a mild moment condition is 
needed, (ii) Nevertheless, accuracy is noticeably improved through using the 
bootstrap, (iii) These results also apply in the case of generalized family-wise 
error rate, (iv) Owing to the limitation of on the number of simultaneous 
hypotheses that can be accurately tested for a given n and a at, other meth- 
ods of pre-screening are needed when there are excessively many hypotheses 
to be tested. 

In a sequence of papers, Finner and Roters (1998, 1999, 2000, 2001, 
2002) developed theoretical properties of n simultaneous hypothesis tests 
as n increases. However, their work differs from ours in a major respect, 
through their assumption that the exact significance level can be tuned to a 
known value in the continuum. In such cases there is no theoretical limit to 
how large nN can be. By way of comparison, in the setting of the genetic 
problems that motivates our work, level inaccuracies limit the effective size 
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of N; we shall delineate this limitation using both theoretical and numerical 
arguments. 

The paper is organized as follows. In section 2 we formulate the accuracy 
problem for simultaneous tests. There, we also outline statistical models and 
testing procedures. Our main results are presented in section 3, where we 
answer the question of how many hypotheses can be tested simultaneously 
without the overall significance level being seriously in error. The theoretical 
definition of the latter property is that the overall significance level should 
converge to the nominal one as the number of tests increases, if the samples 
on which the tests are based are independent; and that the limiting level 
should not exceed the nominal one if the independence condition is violated 
and a Bonferroni bound is used. 

Section 4 outlines the idea of marginal aggregation when the number of 
hypotheses is ultra-large. Numerical investigations among various calibration 
methods are presented in section 5. Technical proofs of results in section 3 
are relegated to section 6. 

2. MODEL AND METHODS FOR TESTING 

2.1. Basic model and methodology. The simplest model is that where we 
observe random variables 

Yij = Hi + €ij , 1 < % < oo , 1 < j < n , (2.1) 

with the index i denoting the zth gene, j indicating the jth array, and the 
constant [ii representing the mean effect for the ith gene. We shall assume 
that: 

for each i, en, . . . , 6j n are independent and identically distributed random 
variables with zero expected value. 

(2.2) 
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The results given below are readily extended to the case where n = rti 
depends on i, but taking n fixed simplifies our discussion. 



For a given value of % we wish to test the null hypothesis H Qi that \ii = 0, 
against the alternative hypothesis Hn that (ii 7^ 0, for 1 < i < N say. 
We first study this classical testing problem of controlling the probability of 
making at least one false discovery, which requires calculating P-values with 
accuracy o(iV _1 ), the same as that needed in (1.3). We then extend our 
results to control the relaxed FDR in (1.5), which is less stringent. 

A standard test is to reject Hqi if \Ti\ > t a . Here, t a denotes the solution 
of either of equations 



P(\Z\>t a ) = l-{l-a) 1 ' N , P{|T(n-l)|>t a } = l-(l-a) 1 ^, (2.3) 



where N and T(k) have respectively the standard Normal distribution and 
Student's t distribution with k degrees of freedom. Note that (2.3) serves 
only to give a definition of t a that is commonly used in practice; it does 
not amount to an assumption about the sampling distribution of the data. 



argument in this paper is that the accuracy of the distributional approxima- 
tions implicit in (2.3) are based on a delicate relationship between n and N, 
which is central to the question of how many simultaneous tests are possible. 

2.2. Significance levels for simultaneous tests. If Hoi is true then the signifi- 
cance level of the test restricted to gene z, is given by 



Let Ti = n 1 / 2 Yi/ Si, where 




Indeed, oln = 1 — (1 — a) l / N and a is a one-to-one map. The core of the 



Pi = P 0i (\Ti\ > t a ) , 



(2.4) 
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where Poi denotes probability calculated under H^. For the classical ap- 
proach (1.3), which is nearly the same as the Benjamini-Hochberg method, 
we ask how large N can be so that 

max \pi — cxn\ = o(a7v)? (2.5) 

l<i<N 

The answer depends on the rate at which approaches to zero. In partic- 
ular, if aw = /3/N, (2.5) entails that 

N 

max pi = o(l) and > Pi = 3 + o(l) , (2.6) 

Ki<N 

— i=i 
for some < f3 < oo? Result (2.6) implies that the significance level of the 
simultaneous test, described in section 2.1, is 

a(N) = P^Hqi rejected for at least one i in the range 1 < % < ivj (2.7) 

N 

<X> = /3 + o(l). (2.8) 

i=i 

If, in addition to (2.2), we assume that 

the sets of variables {e^-, 1 < j < n} are independent for different i, 

(2.9) 

then 

N / N \ / N \ 

a(N) = l-IJ(l-Pi) = l-exp(-X;Pi)+0(Erf)- ( 2 - 10 ) 

i=l ^ i=l ' ^ i=l ' 

Consequently, (2.6) and (2.10) imply the following property: 

if (2.9) holds then a(N) = 1 - e"" + o(l) , (2.11) 

where a(N) is as defined at (2.7). The "o(l)" terms in (2.8) and (2.11) are 
quantities which converge to zero as N — > oo. Result (2.11) also holds, with 
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the identity a(N) = l-e _/3 + o(l) replaced by a (N) < l-e _/3 + o(l), if (2.9) 
is replaced by the positive regression dependency assumption (Benjamini and 
Yekuteli, 2001 p. 1170). 

In practice we would take (3 = — log(l — a), if we were prepared to 
assume (2.9) (or positive regression dependency of the test statistics) and 
wished to construct a simultaneous test with level close to a [or, respec- 
tively, asymptotically not exceeding a]; and take j3 = a, if we were using 
Bonferroni's bound to construct a conservative simultaneous test with the 
same approximate level. Further discussion of the dependent-data case is 
given in section 2.3. 

2.3. Generalized family-wise error rate. The results above can be generalized 
by extending the definition at (2.7) to, 

ak(N) = P^H Q i rejected for at least k values of i in the range 1 < i < N^j 
1 N 

< £ ^^ = /c" 1 /3 + o(l), (2.12) 

i=i 

where (2.12) follows from (2.6). We also have the following analogue of 
(2.11): Assuming (2.6): 

k 

if (2.9) holds then a k (N) = 1 - V ^ e' 13 + o(l) . (2.13) 

i=i J ' 

Under the positive regression dependency assumption, the equality in (2.13) 
would be replaced by <. 

Insight into properties of generalized family-wise error-rate in cases 
of dependency can be gained by considering settings where the processes 
Vj = > 1}, for j > 1, are independent and identically distributed 

moving averages, each with the distribution of V = > 1}, where 
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€i = J2k ® k $i+ki the variables Si are independent and identically distributed 
with zero mean, and the weights 9k may depend on n but satisfy J2 k 9\<C 
for a fixed constant C. Here, under conditions on the distribution of 5, it 
can be shown that if (2.6) holds then the limiting distribution of N, defined 
to equal the number of indices % for which |Tj| > t a , is that of Ylii<i<M Kg, 
where M has a Poisson distribution with mean A = f3/E(K), K\, K 2 , . . . are 
identically distributed as the nonnegative, integer- valued random variable K, 
and M, K\, K 2 , . . . are independent. Thus, the distribution of the number of 
counts is based on a Poisson cluster process, with K denoting cluster size. 

In this case, without the independence assumption (2.9), 

which property generalizes both (2.12) and (2.13). This result is not nec- 
essarily beneficial in practice, however, owing to the difficulty of estimating 
the distribution of K. 

2.4. Methods for calibration. For calibration against Normal or Student's t 
distributions we take the critical point t a to be the solution of the respec- 
tive equations (2.3). Below we consider bootstrap calibration; Edgeworth 
correction (see e.g. Hall, 1990) could also be used. 

Let Fj| , ... , Y} n denote a bootstrap resample drawn by sampling ran- 
domly, with replacement, from X = \Xi\-, • • • , Yin}- Put Y*- = — % and 
T* = n^Yf/St, where Y* = n" 1 £. Y* and (S*) 2 = n" 1 ^ {Y*~Y*)\ 
Write z a for the conventional Normal critical point for iV simultaneous tests. 
That is, z a solves the equation P(\Z\ > z a ) = 1 — (1 — a) l / N . (We could 
also use the Student's t point.) Define £ = fi(a) to be the solution of the 
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equation 

p(\T?\>zt\y i ) = i-(i- a )V N . 

Our bootstrap critical point is t ioL = Zf.( a y, we reject H 0i if and only if |Tj| > 
i ia . The definition of Pi at (2.4) should here be replaced by, 

Pl = P 0l (\T t \ >Ua). (2.14) 

With this new definition, (2.11), (2.12) and (2.13) continue to be a conse- 
quences of (2.6). 

3. THEORETICAL RESULTS 

3.1. Asymptotic results. Define Kis to be the third cumulant, or equivalently 
the skewness, of the distribution of e- = €u/(Eef 1 ) 1 / 2 . 

Theorem 3.1. Assume that 

max E\eA 3 = 0(1) (3.1) 
l<i<N v ' 

as N — > oo, and suppose too that N = N(n) — > oo in such a manner that 
(log N)/n 1 ^ — > 7, where < 7 < 00. Define t a by either of the formulae at 
(2.3), andp t by (2.4). Then (2.6) holds with 

= P(N) = - 1 ° g(1 7V ~ a) £ cosh (I 7 3 «is) , (3.2) 

where cosh(x) = (e x + e~ x )/2. 

The value of /3(N), defined at (3.2), is bounded by | log(l-a)| cosh(7 3 B), 
uniformly in N, where B = sup^ | /c^3 1 . 

Corollary 3.2. Assume the conditions of Theorem 3.1. If 7 = 0, i.e. if 

logiV = ©(n 1 / 3 ), then (2.6) holds with f3 = — log(l — a); and if 7 > then 
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(2.6) holds with (3 = — log(l — a) if and only if iV _1 J2i<N \ Ki ^\ ~^ 0' J - e - ^ 
and only if the limit of the average absolute values of the skewnesses of the 
distributions of en, . . . , 6ni equals zero. 

Since Corollary 3.2 implies (2.6) then it also entails (2.8) and (2.11)-(2.13). 

Theorem 3.3. Strengthen (3.1) to the assumption that for a constant C > 
0, -P(|e-| < C) = 1, and suppose too that N = N(n) — > oo in such a manner 
that logiV = oin 1 / 2 ). Define ii a = Zf.( a y as in section 2.4, and define pi 
by (2.14). Then (2.6) holds with (3 = - log(l - a). 

3.2. Applications to controlling error rate. Define t a and U a by (2.3) and 
as in section 2.4, respectively. In the proof of Theorem 3.1 it is shown 
that, with (3 = — log(l — a) and using conventional calibration, Poid^il > 
t a ) = 13 N~ x + o(N~ 1 ), uniformly in i under the null hypotheses, provided 
logiV = o(n 1 / 3 ); and that, when employing bootstrap calibration, P 0i (\Ti\ > 
ha) = (3 N~ l + o(N~ 1 ) : again uniformly in z, if logiV = ©(n 1 / 2 ). These 
results substantially improve a uniform convergence property of Kosorok 
and Ma (2005), at the expense of more restrictions on N. 

When the P- values in (1.3) need to be estimated, the estimation errors 
should be of order o(iV _1 ), where iV diverges with n. On the other hand, 
when P-values in (1.5) are estimated, the precision can be of order o(ajv), 
where for definiteness we shall take oln = C N~ a with C > and a G (0, 1]. 
In the latter case the results in Theorems 3.1 and 3.3 continue to apply; there 
is no relaxation, despite the potential simplicity of the problem. 

To appreciate why, note that the tail probability of the standard Normal 
distribution satisfies P(\Z\ > x) ~ exp(— x 2 /2) / [\/2tt x). Suppose that the 
large deviation result holds up to the point x = x n , which should be of order 
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o(n 1 / 6 ) for Student's t calibration, and oin 1 ^) for bootstrap calibration. 
Setting P(\Z\ > x) equal to ajy yields log(l/ajv) = \ %n + logx n + . . ., that 
is, 

a log N = \x 2 n + logx n + log(v^7rC) + smaller order terms . 

Regardless of the values of C > and a G (0, 1], this relation implies that the 
condition x n = o(n 1 ^ 6 ) is equivalent to logiV = o(n 1 ^ 3 ) and x n = o(n 1 ^ 4 ) 
entails log N = o(n 1 ^ 2 ), although taking a close to will numerically improve 
approximations, in both theory and practice. 

4. MARGINAL AGGREGATION 

We have shown that with bootstrap calibration, Student's t-statistics 
can test simultaneously a number of hypotheses of order exp{o(n 1 / 2 )}. Al- 
though this value may be conservative, it may still not be large enough for 
some applications to microarray and tiling arrays where the number of si- 
multaneous tests can be even larger. Similarly, whether we consult a t-table 
or a Normal table depends very much on mathematical assumptions. For 
example, suppose an observed value of a t statistic is 6.7. Its corresponding 
two-tail P- value, for n = 6 arrays, is 0.112% when looking up t-tables with 
five degrees of freedom, and 2.084 x 10 -11 when consulting Normal tables. 
Yet, both distributions can be regarded as the approximate ones. 

To overcome these problems, Reiner et al. (2003) and Fan et al. (2004) 
introduce marginal aggregation methods. The basic assumption is that the 
null distributions of test statistics Tj are the same, denoted by F. With this 
assumption (which also implicitly assumed when the same distribution table 
is looked up), we can use the empirical distribution of {Tj, i = 1, . . . , iV}, i.e. 

JV 

F N (x) = N- 1 ^I(Ti<x), (4.1) 

i=i 
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as an estimator of F. This turns the "curse-of-dimensionality" into a "blessing- 
of-dimensionality" . In this case, even if n is fixed, the distribution of F can 
still be consistently estimated when N — > oo and only a small fraction of 
alternative hypotheses are true. 

Reiner et al. (2003) and Fan et al. (2004) carry this idea one step further. 
They aggregate the estimated distributions for each T^, based on a resampling 
technique (more precisely, a permutation method). For example we can use, 
as an estimator of F(x), the average of bootstrap estimators, 

JV 

F* N (x) = N- 1 J2P(T*<x\y i ). (4.2) 

i=i 

The theorem below describes properties of F N and F^, defined at (4.1) 
and (4.2), respectively. Interpret x n , below, as a left-tail critical value; sim- 
ilar results also hold in the upper tail. 

Theorem 4.1. Let N\ be the number of nonzero i.e. the number of 
elements of the set {i : Hu is true}, and put F n (x) = E{P(T* < x)}. Then, 

F N (x n ) = F(x n ) + O p [(N t /N) + s/F{x n )/N {l + a n N + a n N 1 F{x n )-^ 2 } 1/2 
F* N (x n ) = F n (x n ) + O p {v^P} , 

provided that \rij\ < a n , with denoting the correlation coefficient between 
I(T{ < x n ) and I(Tj < x n ). 

The term 0{N\/N) in (4.3) reflects the bias of the estimate. It can be 
reduced by using the sieve idea of Fan et al. (2005). 

5. NUMERICAL PROPERTIES 

In our simulation study we construct models that reflect aspects of gene 
expression data. To this end, we divide genes into three groups. Within 
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each group, genes share one unobserved common factor with different factor 
loadings. In addition, there is an unobserved common factor among all the 
genes across the three groups. For simplicity of presentation, we assume that 
N is a multiple of three. We denote by Zij a sequence of independent N(0, 1) 
random variables, and Xij a sequence of independent random variables of 
the same distribution as that of (Xm — m) / \Jlm. Note that Xij has mean 0, 
variance 1 and skewness ^/8/m. In our simulation study we set m = 6. 

With given factor loading coefficients ai and hi, the error eij in (2.1) is 
defined as 



var(eij ) = 1, and that the within-group correlation is in general stronger than 
the between-group correlation, since the former shares one extra common 
factor. We consider two specific choices of factor loadings: Case I, where the 
factor loadings are taken to be cij = 0.25 and bj = 0.1 for all j (thus, the 
ejj's have the same marginal distribution, although they are correlated); and 
Case II, for which the factor loadings and hi are generated independently 
from, respectively, U(0, 0.4) and £7(0,0.2). 

The "true gene expression" levels \ii are taken from a realization of 
the mixture of a point mass at and a double-exponential distribution: 
c $o + \ (1 — c) exp(— \x\), where c G (0, 1) is a constant. With the noise 
and the expression level given above, generated from (2.1) represents, for 
each fixed j, the observed log-ratios between the two-channel outputs of a 
c-DNA microarray. Note that \fij\ > log 2 means that the true expression 




i = l,...,N, j 



where a^- = except that an = for i = 1, . . . , 3 N, = at for i 
1, . . . , I TV, and a i3 = ai with i = | N + 1, . . . , N. Note that Ee^ 



~3 N + 



and 
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ratio exceeds 2. The probability, or the empirical fraction, of this event 
equals \ (1 — c). 

For each a at we compute the P- value according to the Normal approx- 
imation, t-approximation, the bootstrap method and the aggregated boot- 
strap (4.2). This results in N estimated P-values Pj for each method and 
each simulation. Let N\ denote the number of P-values that are no larger 
than aN', see (1.5). Then, Ni/N is the empirical fraction of null hypothe- 
ses that are rejected. When c = 1, Ni/(NotN) — 1 reflects the accuracy 
with which we approximate P-values. Its root mean square error (RMSE), 
{E(Ni/(N(Xn) — l) 2 } 1 / 2 , will be reported, where the expectations are ap- 
proximated by averages across simulations. We exclude the marginal aggre- 
gation method (4.1), since in our simulations it always gave Ni/N = ccn- 

We take N = 600 (small), N = 1, 800 (moderate) and N = 6, 000 (typi- 
cal) for microarray applications (after preprocessing, which filters out many 
low quality measurements on certain genes) and = 1.5iV -2 / 3 , resulting 
in ct 7v = 0.02, 0.01 and 0.005, respectively. The sample size n is taken to be 6 
(typical number of microarrays) , 20 (moderate) and 50 (large). The number 
of replications in simulations is 600,000/iV. For the bootstrap calibration 
method and the aggregated method (4.2), we replicate bootstrap samples 
2,000, 4,000 and 9,000 times, respectively, for a N = 0.02, 0.01 and 0.005. 

Tables 1 and 2 report the accuracy of estimated P-values when c = 1. 
It can be seen that the Normal approximations are too inaccurate to be use- 
ful. Therefore we shall exclude the Normal method in the discussion below. 
For n = 20 and 50, the bootstrap method provides better approximations 
than Student's t-method. This indicates that the bootstrap can test more 
hypotheses simultaneously, which is in accord with our asymptotic theory. 



18 

Overall the bootstrap method is also slightly better then the aggregated 
bootstrap (4.2), although the two methods are effectively comparable. How- 
ever, with the small sample size n = 6, Student's t-method is relatively the 
best, although the approximations are poor in general. This is understand- 
able, as the noise distribution is not Normal. With such a small sample size, 
the two bootstrap-based methods, in particular the aggregated bootstrap 
method (4.2), suffer more from random fluctuation in the original samples. 

We now illustrate the four methods by using a microarray data set. The 
data were analyzed by Fan et al. (2004), where the biological aim was to 
examine the impact of the stimulation by MIF, a growth factor, on the ex- 
pressions of genes in neuroblastoma cells. Six arrays of cDNA microarray 
data were collected, consisting of relative expression profiles of 19,968 genes 
in the MIF stimulated neuroblastoma cells (treatment) and those without 
stimulation (control). After preprocessing that filtered low quality expres- 
sion profiles, 15,266 genes remained. Within-array normalization, discussed 
by Fan et al. (2004), was applied to remove the intensity effect and block 
effect. Among 15,266 gene expression profiles, for simplicity of illustration 
we focussed only on 7,583 genes that do not have any missing values. In our 
notation, N = 7,583 and n = 6. Table 3 summarizes the results at different 
levels of significance. 

Different methods for estimating P-values yield very different results. In 
particular, the distribution of P-values computed by looking up the normal 
table is stochastically much larger than that based on the t-table, which in 
turn is stochastically much larger than that based on the bootstrap method. 
The results are very different. As noted before, the normal approximation 
is usually very poor and grossly inflates the number of significant genes. 
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The same remark applies to the t-approximation with moderate or large 
numbers of degrees of freedom. The most accurate approximation is given 
by bootstrap methods. 



For the sake of brevity we shall derive only Theorems 3.1 and 3.3. Let 
C\ > 0. Given a random variable X with E(X) = 0, consider the condition: 



The following result follows from Theorem 1.2 of Wang (2005), after trans- 
forming the distribution of T to that of (£\ X i )/(^2 i Xf). 

Theorem 6.1. Let X, Xi,X 2 ,... denote independent and identically dis- 
tributed random variables such that (6.1) holds. Write T = T(n) for Stu- 
dent's t statistic computed from the sample Xi, . . . , X n , with (for the sake 
of dehniteness) divisor n rather than n — 1 used for the variance. Put tvs = 
— | Ks, where Ks denotes the skewness of the distribution of X/^arX) 1 / 2 . 



where 9 = 9(x, n) satisfies \6{n, x) \ < C 2 uniformly in < x < C%n 1//4 and 
n > 1, and C2, C3 > depend only on C\. 

Theorem 3.1 in the case of Normal calibration follows directly from 
Theorem 6.1. The case of Student's t calibration can be treated similarly. 

To derive Theorem 3.3, note that each var(e' i ) = 1. To check that, 
with probability at least p n = 1 — exp(— d\ n 1 / 2 ) for a constant d\ > 0, 
the conditions of Theorem 6.1 hold for the bootstrap distribution of the 



6. PROOFS OF RESULTS IN SECTION 3 



E(X) = 0, E(X 2 )=1, E(X 4 )<d. 



(6.1) 



Then, 




(6.2) 
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statistic T*, for each 1 < i < N, it suffices to show that there exist constants 

1 /2 

< C 4 < C 5 ' such that, with probability at least p n , the following condition 
holds for 1 < i < N: 

1 n 1 n 

c 4 < - v; (y y - Yj) 2 , - E - F *) 4 ^ c * ■ ( 6 - 3 ) 

This can be done using Bernstein's inequality and the assumption that, for 
each i, P(|eJ| < C) = 1, and can also be shown by the uniform convergence 
result of the empirical process of Korosok and Ma (2005). 

Let S n denote the event that (6.3) holds for each 1 < % < N. When £ n 
prevails, we may apply Theorem 6.1 to the distribution of T* conditional on 
[Vj, obtaining: 

P{T* > x | y t ) = {1 - exp ( - I k i3 n- 1 ' 2 x 3 ) |l + § t , 

(6.4) 

where ki3 is the empirical version of ^3, computed from y^ and, on an event 
of which the probability equals 1— (9{exp(— c?2 n 1 ^ 2 )}, \6i\ < Di uniformly in i 
and in < x < x n . (Here and below, x n will denote any sequence diverging 
to infinity but satisfying x n = o(n 1 / 4 ), and Di, D 2 , . . . and di,d2, ••• will 
denote constants.) It follows directly from Theorem 6.1 that 

Po.iT, > x) = {1 - exp ( - § « i3 n- 1 ' 2 x 3 ) |l + 6 } , (6.5) 

where \9i\ < D\ uniformly in % and in < x < x n . 

Result (6.5), and its analogue for the left-hand tail of the distribution 
of Tj, allow us to express t ia , the solution of the equation P i(|^i| > Ua) = 

1 — (1 — a) 1 /^, as a Taylor expansion: 

\U a - z a -c K i3 n~ 1/2 z 2 a \ < D 2 {n~ l z* + n~ 1/2 z a ) , 
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uniformly in i, where c is a constant and z a is the solution of P(\Z\ > z a ) = 
1 — (1 — a) x / N . Note that if z a solves this equation then z a ~ (2 logiV) 1 / 2 , 
and so, since logiV = o(n 1 / 2 ), then z a = o(n 1 / 4 ). Therefore, without loss of 
generality, < z a < x n . Likewise we may assume below that < ti a < x n , 
and < ti a < %n with probability 1 — 0{exp(— <i 2 n 1 / 2 )}. 

Also, from (6.4) we can see that on an event of which the probability 
equals 1 — (3{exp( — d 2 n 1 / 2 )}, 

\iia - z a - ck i3 n~ 1/2 z a \ < D 3 (n -1 z a + n~ 1/2 z a ) . 

However, on an event with probability 1 — 0{exp(— <i 3 n 1 / 2 )}, |£j 3 — K i3 | < 
D4 n -1 / 4 , and therefore, on an event with probability 1 — (3{exp( — n 1 / 2 )}, 

\iux ~z a -c K i3 n~ 1/2 zl\<D 5 (n' 1 z a + n' 1 / 2 z a + n" 3/4 z 2 a ) . 

It follows from the above results that Poid^il > Ua) lies between the 
respective values of 

Poi(\Ti\ > U a ± 5) T As exp ( - rf 4 n 1/2 ) , (6.6) 

where 

5 = D 5 (n-^ + n-^Za+n-Wg). 

Using (6.5), and its analogue for the left-hand tail, to expand the probability 
in (6.6), we deduce that 

P 0l (\T l \ > t ia ±6) = P Ql (\T l \ > t^) {1 + o(l)} , 

uniformly in i. More simply, exp(— n 1 ! 2 ) = o{Poi(\Ti\ > tia,)}, using the 
fact that z a = o(n 1 ^ 4 ) and exp(— D7 z 2 a ) = o{Poi(|2"i| > U a )} for sufficiently 
large D 7 > 0. Hence, 

P 0l (\Ti\ > Ua) = P 0l (\T l \ > t la ) {1 + 0(1)} , 
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uniformly in i. Theorem 3.3 follows from this property. 
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